A Reproducible Data Workflow



McCrea Cobb

Refuge Inventory and Monitoring
mccrea_cobb@fws.gov
mccrea-cobb

Emma Schillerstrom

Refuge Inventory and Monitoring
emma_shillerstrom@fws.gov
eschillerstrom-usfws

Jonah Withers

Fisheries and Ecological Services
jonah_withers@fws.gov
JonahWithers-USFWS



Analysis

Analysis

Criteria for Best Practice

  • The steps are clear
  • The workflow is reproducible
  • Limited opportunities for human error

Analysis: Manual Workflow

Steps

  1. Reformat observation data
  2. Get spatial covariates
  3. Load data into software (PRESENCE, Mark, Distance, etc.)
  4. Use interface to select options
  5. Run it and export results

Compare Workflows

Manual

  • Restricted to functions in the software


  • “Black Box”
  • Difficult/time-consuming to document and reproduce steps
  • Must extract results into another software to visualize

Scripted

  • Many custom R packages for performing most common ecological data analyses
  • R packages are generally well-documented
  • Your code documents your steps


  • Self-contained workflow

Witch Survey

Analysis

Get Covariates

# Refuge boundary
source("./R/spatial_helpers.R")
tetlin <- get_refuge("Tetlin National Wildlife Refuge")

# NLCD layer
library(FedData)
get_nlcd(tetlin, label = "tetlin", year = 2016, landmass = "AK")

Calculate Distances

library(terra)

# Calculate distance to forest
forest <- terra::segregate(nlcd, classes = 42)  # Extract the forest layer
forest <- terra::classify(forest, 
                          rcl = matrix(c(1, 0, 1, NA), 
                                       nrow = 2, 
                                       ncol = 2))  # Reclassify 0 as NA
forest <- terra::distance(forest)
forest <- project(forest, "EPSG: 4326")  # Reproject to WGS84
forest <- mask(crop(forest, ext(tetlin)), tetlin)
names(forest) <- "forest"

# Calculate distance to water
water <- terra::segregate(nlcd, classes = 11)  # Extract the water layer
water <- terra::classify(water, rcl = matrix(c(1, 0, 1, NA), 
                                             nrow = 2, 
                                             ncol = 2))  # Reclassify 0 as NA
water <- terra::distance(water)
water <- project(water, "EPSG: 4326")  # Reproject to WGS84
water <- mask(crop(water, ext(tetlin)), tetlin)
names(water) <- "water"

Extract Covariates to Sites

library(sf)
library(terra)

sites <- data.frame(sf::st_coordinates(sites),
                    forest = terra::extract(forest, sites)$forest,
                    water = terra::extract(water, sites)$water)
X Y forest water y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8
-141.3999 62.7348 0.00 947.74 0 0 0 0 0 0 0 0
-141.2218 62.6009 0.00 1720.93 0 0 0 0 0 0 0 1
-141.1138 62.4666 67.79 1540.31 0 0 0 0 0 0 0 0
-141.8892 62.6574 43.63 0.74 1 0 1 0 0 1 0 1
-141.2041 62.5027 6.02 443.13 0 0 1 1 0 1 0 0
-142.0298 63.0688 0.00 765.78 0 0 0 0 0 0 0 0
-141.6778 62.5465 3.49 2245.14 0 0 0 0 0 0 0 0
-141.0126 62.4304 0.00 170.52 1 0 0 1 1 0 0 0
-141.8592 62.7298 4.57 1055.73 0 0 0 0 0 0 0 0
-141.7650 62.7891 46.97 137.62 0 0 0 0 0 0 0 0
-141.3865 62.6102 320.58 374.49 0 0 0 0 0 0 0 0
-142.4754 62.6560 0.16 1272.86 1 0 1 0 0 0 1 1

Single-Season Occupancy Model

# Required package
library(unmarked)

# Create an unmarked data frame
unmarked_df <- unmarkedFrameOccu(y = y, siteCovs = site_covs)
sc <- scale(site_covs)
siteCovs(unmarked_df) <- sc

# Fit a single-season occupancy model
mod <- unmarked::occu(formula = ~forest ~ water + 
                                forest, 
                        data = unmarked_df[[1]])

# Look at the estimates
mod@estimates
Occupancy:
            Estimate    SE      z P(>|z|)
(Intercept)   0.0923 0.213  0.433  0.6651
water         0.8643 0.337  2.568  0.0102
forest       -0.4465 0.300 -1.487  0.1370

Detection:
            Estimate    SE     z P(>|z|)
(Intercept)  -0.0507 0.115 -0.44  0.6596
forest        0.7406 0.325  2.28  0.0226

Summarize Results

Summarize Results

Criteria for Best Practice

  • Customizable
  • Updateable
  • Standardized

Summarize Results: Manual Workflow

Steps

  1. Import results and data into Excel
  2. Create summary tables and plots
  3. Cut/paste into Word doc
  4. Reformat style to match document
  5. (Repeat…)

A funny image downplaying on our fears of AI by showing a mistake in Excel

https://www.reddit.com/r/ProgrammerHumor/comments/fiw1rw/excel/

Summarize Results: Compare Workflows

Manual

  • Introduce human error
  • Limited plotting function
  • Static output

Scripted

  • Self-contained workflow

Witch Survey

Results

Witch Survey Results

Occupancy

# Load site data and scale them
load("./data/rdata/unmarked_df.Rdata")

sc <- read.csv("./data/csv/sites.csv") |>
    dplyr::select(forest, water) |>
    scale()

# Fit single season occupancy model
mod <- fit_model(unmarked_df)

# Calculate predicted values
pred <- predict_occ(mod, sc)

# Plot predicted values (psi)
ggplot(pred, aes(x = covariateValue, 
                 y = Predicted), 
  group = covariate) + 
  geom_line() +
  geom_ribbon(aes(ymin = lower, 
                  ymax = upper), 
              linetype = 2, 
              alpha = 0.1) +
  xlab("Distance (m)") +
  ylab("Psi") +
  theme_fws() +
  facet_grid(~covariate, scales = "free")

 

Witch Survey Results

Witch Survey Results

Detection



# Plot predicted values (detection)
ggplot(pred, aes(x = covariateValue, 
                 y = Predicted)) +
         geom_line() +
         geom_ribbon(aes(ymin = lower, 
                         ymax = upper), 
                     linetype = 2, 
                     alpha = 0.1) +
         xlab("Distance (m)") +
         ylab("Detection (p)") +
         theme_fws()

Witch Survey Results

#' Create a leaflet map of occupancy within a refuge
#'
#' @param ras a `terra::ras` raster of psi estimates
#' @param s  an `sf::st_point` of the sites surveyed
#' @param r an `sf` multipolygon of the refuge boundary
#' @para p whether to map the predicted value of psi ("Predicted") or SEs ("SE")
#' @param h the height of the leaflet map returned
#' @param w the width of the leaflet map returned
#'
#' @return a leaflet map
#'
#' @import RColorBrewer
#' @import leaflet
#' @import terra
#' 
#' @export
#'
#' @example 
#' \dontrun{
#' create_map(ras = psi, s = sites, r = tetlin, p = "Predicted", h = 650, w = 300)
#' }

create_map <- function(ras, 
                       s, 
                       r,
                       p = "Predicted",
                       h = NULL,
                       w = NULL) {
  if (p == "Predicted") {
    x <- c(round(minmax(ras)[[1,1]],2), 
           round(minmax(ras)[[2,1]],2))
    grp <- "Psi"
    ras <- ras$Predicted
  } else if (p == "SE") {
    x <- c(round(minmax(ras)[[1,2]],2), 
           round(minmax(ras)[[2,2]],2))
    grp <-"SE"
    ras <- ras$SE
  }
  
    
  pal_rev <- colorNumeric(RColorBrewer::brewer.pal(5, "Spectral"), 
                          x, 
                          reverse = TRUE, 
                          na.color = "#00000000")
  pal <- colorNumeric(RColorBrewer::brewer.pal(5, "Spectral"), 
                      x)
  
  leaflet(height = h, width = w) |> 
    addTiles() |>
    addRasterImage(ras, 
                   colors = pal_rev, 
                   maxBytes = 10168580, 
                   opacity = 0.75, 
                   group = grp) |>
    addCircleMarkers(data = s, lat = ~Y, lng = ~X, 
                     radius = 0.5, 
                     color = "black", 
                     group = "sites") |>
    addPolygons(data = r,
                fill = FALSE,
                color = "black",
                group = "Tetlin",
                weight = 0.5) |>
    addLayersControl(overlayGroups = c(grp, 
                                       "sites", 
                                       "Tetlin"),
                     options = layersControlOptions(collapsed = FALSE)) |>
    addLegend(pal = pal,
              values = x,
              title = grp,
              labFormat = labelFormat(transform = function(x) sort(x, decreasing = TRUE))) |>
    addMiniMap(height = 100, 
               width = 100) |>
    addScaleBar()
}



base_map <- function(s, r, h = NULL, w = NULL) {
  leaflet(height = h, width = w) |> 
    addTiles() |>
    addCircleMarkers(data = s, lat = ~Y, lng = ~X, 
                     radius = 0.5, 
                     color = "black", 
                     group = "sites") |>
    addPolygons(data = r,
                fill = FALSE,
                color = "black",
                group = "Tetlin",
                weight = 0.5) |>
    addLayersControl(overlayGroups = c("sites", 
                                       "Tetlin"),
                     options = layersControlOptions(collapsed = FALSE)) |>
    addMiniMap(height = 100, 
               width = 100) |>
    addScaleBar()
}

Witch Occupancy (\(\psi\))

# Source the `create_map()` function
source("./R/create_map.R")

# Import data
tetlin <- sf::st_read("data/shapefile/tetlin.shp", 
                      quiet = TRUE)
sites <- read.csv("data/csv/sites.csv")
psi <- terra::rast("data/raster/psi/psi.tif")

# Create a map
create_map(ras = psi, 
           s = sites, 
           r = tetlin, 
           h = 650,
           w = 300)

Precision of Estimates (SE)

# Source the `create_map()` function
source("./R/create_map.R")

# Import data
tetlin <- sf::st_read("data/shapefile/tetlin.shp", 
                      quiet = TRUE)
sites <- read.csv("data/csv/sites.csv")
psi <- terra::rast("data/raster/psi/psi.tif")

# Create a map
create_map(ras = psi, 
           s = sites, 
           r = tetlin, 
           p = "SE",
           h = 650,
           w = 300)

Report

Report

Criteria for Best Practice

  • Easy to update
  • Clear link between the data and the report
  • Reproducible

Report: Manual Workflow

Steps

  1. Copy/paste tables and figures into Word
  2. Calculate inline statistics and add into doc
  3. Update formatting to look good
  4. Repeat steps for PowerPoint presentation

Reporting in R: Quarto

Reporting in R: Shiny

Witch Survey

Report

Witch Report

```{r}
---
title: "Tetlin Witch Report"
author: Jane Biologist
format: html
fig-align: center
editor: source
---


```{{r setup}}
#| echo: false
#| message: false

knitr::opts_chunk$set(warning = FALSE, 
                      echo = FALSE,
                      message = FALSE, 
                      fig.retina = 3, 
                      fig.align = "center")
library(unmarked)
library(terra)
library(tidyverse)
library(RColorBrewer)
library(sf)
library(leaflet)
```

```{{r load_data}}
#| cache: true

# Load site data
dat <- read.csv("data/csv/dat.csv")

source("R/simulate_data.R")
source("R/create_map.R")

# Scaled covariates
sc <- dat |>
    dplyr::select(forest, water) |>
    scale()

# Load site data and scale them
load("./data/rdata/unmarked_df.Rdata")
```

```{{r fit_model}}
#|cache: true

# Fit single season occupancy model
mod <- fit_model(unmarked_df)
```

## Introduction

Invasive witches have become a management concern at Tetlin National Wildlife Refuge. As such, there is a need to estimate witch occurrence within the Refuge.

## Methods

### Data Collection

We visited a sample randomly distributed sites across Tetlin Refuge. At each site, we spent one hour looking and listening for witches. We revisited each site eight times.

### Model

We estimated witch occupancy and detection using a single-season occupancy model. We used the `unmarked` R package. [blah, blah, blah]


## Results

We surveyed a total of `r nrow(dat)` sites. The average distance to water at our sites was `r round(mean(dat$water), 2)` m. The average distance to forest at our sites was `r round(mean(dat$forest), 2)` m.

```{{r}}
#| out-width: "50%"
#| fig-cap: "A map of the sites surveyed for witches, Tetlin National Wildlife Refuge, Alaska."

# Import data
tetlin <- sf::st_read("data/shapefile/tetlin.shp", 
                      quiet = TRUE)
sites <- read.csv("data/csv/sites.csv")

# Create leaflet map
base_map(sites, tetlin)
```

We observed witches on `r sum(dat[6:13])` of 800 site visits, for a naive occupancy of `r round(sum(dat[6:13])/ncell(dat[6:13]), 2)`. 

```{{r plot_psi}}
#| fig-height: 3
#| fig-cap: Occupancy of witches at Tetlin National Wildlife Refuge, Alaska, 2024.

# Calculate predicted values (unscaled)
pred <- rbind(plotEffectsData(mod, "state", "forest"), plotEffectsData(mod, "state", "water")) %>%
  mutate(covariateValue = case_when(
    covariate == "forest" ~ covariateValue * attr(sc, 'scaled:scale')[[1]] + attr(sc, 'scaled:center')[[1]],
    covariate == "water" ~ covariateValue * attr(sc, 'scaled:scale')[[2]] + attr(sc, 'scaled:center')[[2]]
  ))

# Plot predicted values (psi)
ggplot(pred, aes(x = covariateValue, y = Predicted), 
  group = covariate) + 
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), 
              linetype = 2, 
              alpha = 0.1) +
  xlab("Distance (m)") +
  ylab("Psi") +
  facet_grid(~covariate, scales = "free")
```

```

Pretty Witch Report

Pretty Witch Report

Sharing Survey Products

Summary

Questions